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The method of quasi-optimal observables [hep-ph/0001019] offers a fundamental yet simple and flexible algorithmic 
framework for data processing in high energy physics to improve upon the practice of event selection cuts. 



No better preface could be imagined for my presentation of the 
method of quasi-optimal observables ['] than the talk we've just 
heard [ ] : there is obviously a number of HEP data processing spe- 
cialists around the globe seeking ways — similarly to Dr. Anipko 
et al. — to extract maximum signal from their event samples. The 
talk [ 2 ] is also another evidence (not that there is a lack thereof) 

(f") that the prescriptions which I am going to discuss are unknown to 

f^) many physicists (see [ 3 ] for a discussion of this; see, however, 
Notes added hereafter). 

CN I am going to advocate the method of quasi-optimal observables 
gasa systematic solution of the problem of optimal HEP measure- 
C3 ments, including but not limited to, the finding of optimal event se- 

^"5 lection cuts. The method: 

1— ' ■ derives from first principles of mathematical statistics; 
(N. is simple and fundamental (and as such deserves to be known 
to every non-mathematical physicist); 

■ offers a rather universal guiding principle for doing data 
J> processing in HEP, especially in the precision measurement 

type problems; 

■ offers a flexible algorithmic framework that seems to require 
neither artificial intelligence, nor neural networks (not to men- 
tion ants); in fact its implementations can be based on avail- 

y—i able adaptive algorithms for multidimensional integration 
(among other options) and seem otherwise straightforward. 

The method came about as a by-product of the theory of jet 
definition [ 4 ] and is a result of attempts to understand the basics of 
QjlEP data processing under a premise that adding a fancy idea to a 
jJL^ m ess (a customary way of doing things in what is known as nor- 
mal science [ 5 ]) usually creates only more mess (my favorite illus- 
f| tration of the law is the design of C++; see [ 6 ] and refs. therein). 
^ Let { P, }, be the events (instances of a random variable P) dis- 
. tributed according to the probability density ^r(P); the density is 
^ assumed to depend on a parameter denoted as M. Take any text- 
5— I book of mathematical statistics for physicists and recall that the 
^ simplest method of parameter estimation 1 is that of (generalized) 
moments going back to Pearson (1 894) where one fixes a function 
on events (weight) /(P) and then finds the experimental estimate 
for M — denote it as M\f] — by fitting the theoretical mean 
value, ^/) = JdP^-(P)/(P) , against its experimental ana- 
log (/) exp = W -1 ^. /( P ;) ; denote as VarM[/] the variance of the 
estimate, then the actual error from the fit is estimated 
as *jN~ [ \/ar M[f] . The method is simple but is considered inferior 

to the extent that it is no longer mentioned in the PDG booklets. 
The problem is that there have been no prescription to choose/(P) 
sensibly, i.e. so as to minimize VarM[f] . 

On the other hand, Fisher's (1912) method of maximal likeli- 



1 At this point it is worth emphasizing that parameter estimation is a more ade- 
quate interpretation of HEP data processing than event selection which — one 
tends to forget — is only an auxiliary tool. 



hood prescribes to estimate M by the value which maximizes the 
expression 2 ; ln^r(P ; ) . The method maybe difficult to apply (es- 
pecially if the probability density is not known analytically or the 
number of events is large, as is often the case in HEP), but it is op- 
timal in the sense that the resulting estimate M opt corresponds to 
the absolute lower bound as established by the fundamental Fisher- 
Frechet-Rao-Cramer inequality which for our purpose can be writ- 
ten as follows: 

VarM[/]>VarM opt . (1) 

But however inferior the method of moments may seem, it 
represents a fundamental viewpoint in that the probability distribu- 
tion can be equated to the collection of all average values (/) (re- 
call the mathematical definition of distributions as linear function- 
als on test functions; probabilistic measures are special cases of 
distributions). However abstract this may seem, at least it's an in- 
dication (it was for me) that the method of moments may have a 
deeper significance than is the popular perception. 

So, let us ask a simple question: given that VarM[/] is 

bounded from below, where in the space of / is the minimum lo- 
cated? A baffling aspect of all this (discussed in [ 3 ]) is that there is 
little evidence that in the 100+ years since the introduction of the 
method there have been serious attempts to find ways to determine 
which / are better than others in this respect — given that the 
minimum has been known to exist for 50+ years (see, however, 
Notes added hereafter). 

Once one's asked the question, it is straightforward to write 
down the following criteria for the minimum: 



-VarM[/] = 0, 



-VarM[/]>0. 



Sf(P) Sf(P)Sf(Q) 
Further, in the statistical limit one has 

Var M[f] = ( Var / ) x [3 (/ )/dM J 2 . 
Simple calculations (for details see [']) yield the solution: 



/opt(P) = 



3ln^(P) 
3M 



(2) 



(3) 



(4) 



(up to an additive and a multiplicative constants that may both de- 
pend on M) and that for small deviations from optimality 



v -w opt+ ^]=(4 t r + o(^)(4 t )- 3 - 



+ ... 



(5) 



where o(<p 2 ) is a known non-negative expression, which fact is 

essentially equivalent to the FFRC inequality (1). Note that/ op , has 
to be fixed for some M . If Mi is the value extracted from data us- 
ing such f opt , then Mi can be used to reevaluate f opt , and so on. The 
resulting iterative procedure is seen to be essentially equivalent to 
the optimization involved in the maximal likelihood method. 

The beautiful aspect of (5) is that the quadratic nature of the 
second term on the r.h.s. of (5) (for which an explicit analytical 
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expression is available [']) means that / op , need not be known ex- 
actly: working with an approximation f q _ opt to the optimal moment 
/ p t may be practically sufficient. The approximation is then called 
quasi-optimal observable. Note that one may be able to construct 
satisfactory quasi-optimal observables even in situations when the 
method of maximal likelihood is not applicable, e.g. when the 
probability density is only known in the form of a MC generator 
rather than analytical formula. 

As a typical example, consider measurements of a process me- 
diated by an intermediate particle (cf. [ 2 ]; another example is the 
top quark search in the all-jets channel at DO [ 7 ]). Then the prob- 
ability distribution is represented as follows (remember that the 
overall normalization plays no role in this): 

^(P) = ^ bg (P)+^ 2 ^ signal (P). 



(6) 



The first unknown parameter is g , usually corresponding to the in- 
termediate particle' s coupling. Other parameters (e.g. its mass M) 
may be hidden within ft sigaai ■ 

The optimal observable for measurement of g 2 (effectively, of 
the production rate for this channel) is 



^optC) 



ain^-(P) 



8 ^signal C) 



*bg( P ) + S Signal 



(P) 



-<1. 



(7) 



It is straightforward to generate an approximation for this using a 
MC generator. Note that Eq. (7) approaches 1 whenever the signal 
exceeds the background appreciably, and approaches in the op- 
posite case. Should F opt happen to be such as to take only values 
and 1 then it would exactly be the selection criterion that Dr. 
Anipko et al. [ 2 ] devised their algorithm to seek. 

However, F opt is a continuous function in practical situations. 
This means that the optimal way (i.e. one ensuring minimum er- 
rors) to measure the production rate for this channel is to evaluate 
the sum 

Z^o P t(P,) (8) 

rather than count events selected with any 0/1 approximation to 
F opt . We are forced to conclude that any conventional data process- 
ing procedure based on 0/1 selection cuts — no matter what kind 
of intelligence it relies upon — involves a loss of physical infor- 
mation — i.e. yields systematically higher errors compared with 
the procedure based on the optimal observable. 

How can one estimate the losses? Given that the hypersurface 
which separates the and 1 regions in the space of events is mostly 
regular, and so has codim= 1 , it is sufficient to examine a one- 
dimensional situation. So let Pe[0,l] and ;r(P) = P. Compare the 

fluctuations of two observables, / C ont ( p ) = P an ^ 
/ cut (P) = IF P < i THEN ELSE 1 END. The second one corre- 
sponds to the usual cuts whereas the first one can be an optimal 
observable. For the statistical errors in the two cases, one finds 
<r cut / <7 cont =1.6 — a substantial factor sufficient to transform a 3(7 

signal into a 5cr discovery (or vice versa) ! This is an optimistic 
case (or pessimistic, depending on the viewpoint). But it does indi- 
cate the range of possibilities. 

Next, suppose one wishes to measure, say, the mass M of the in- 
termediate particle which is responsible for the production channel 
being studied. The corresponding optimal observable is repre- 
sented as follows (the interesting factorized form is new compared 
with the earlier discussions): 



/opt,M C) _ 



S 2 ^M ^signal 
X hg + g Signal 



^p t ( P )X 3 M ln ^i g nal( P ) 



(9) 



Again, the factorized form on the r.h.s. nicely corresponds to the 
conventional procedures: the first factor corresponds to event se- 
lection cuts for this channel and the second factor, to the specific 
measurement with the selected events. 

Another example is as follows. One can construct optimal ob- 
servables on events reduced to a few parameters using e.g. a jet al- 
gorithm. Then the quality of the resulting measurement procedure 
depends on the reduction (jet) algorithm used and one could em- 
ploy the quantity (/ 2 pt ) that emerges in (5) to compare different 

jet algorithms: the larger this number, the smaller the error of the 
measurement of M, and the better the jet algorithm. This number 
can also be used to control the tradeoffs between quality and speed 
involved in various optimizations. A comparison of this kind has 
been undertaken in [ 8 ]. Referring the reader for details to [ 8 ], I only 
cite the finding that the optimal jet definition introduced in [ 4 ] 
proves to be equivalent on this test to the k T algorithm [ 9 ], with 
both appreciably better than the JADE algorithm [ l0 ] . Furthermore, 
a simple optimization makes the implementation of the optimal jet 
definition reported in ["] more than twice as fast as any of the 
other two algorithms without noticeable loss in quality. 

To conclude, the method of quasi-optimal observables com- 
bines: 

S the simplicity of use of the method of moments; 
S the optimal quality of results of maximal likelihood; 
■S algorithmic flexibility; 

S a deterministic method to replace and improve upon event se- 
lection procedures based on cuts. 

S It also meshes well with advanced theoretical calculations (al- 
lowing theorists to use arbitrarily singular expressions for 
higher order corrections to probability distributions — once 
quasi-optimal observables have been agreed upon with ex- 
perimentalists); 

J and offers a complete analytical control over one's optimiza- 
tions etc. (via the quantity (/ 2 pt ) )• 

The next task is to engineer good software implementations of the 
method. 

I thank E.Jankowski for several discussions and A.Czarnecki 
for the hospitality at the University of Alberta, Edmonton, where 
some of this work was done. This work was supported in parts by 
the NATO grant PST.CLG.977751 and the Natural Sciences and 
Engineering Research Council of Canada. 

Notes added 

After the first and second postings of this text, first Dr. A. Soni 
and then Drs. M. Diehl, O. Nachtmann and F. Nigel notified me of 
some relevant earlier publications. I am particularly indebted to 
Dr. Diehl for kindly sending me copies of the papers published in 
the early 90s that are unavailable on-line and in Russian libraries. 
The summary below rectifies some incorrect statements made in 
Notes added in the second posting of this text. 

It turns out, the important special case of linear dependence on 
the measured parameter, Eqs. (6), (7), has a rather rich history of 
ten years — although it has been remaining a sort of esoteric 
knowledge in the community of specialists in weak and anomalous 
couplings. 

Ref. [12] derived the corresponding optimal observable using 
orthogonality in Hilbert spaces familiar from quantum mechanics. 

An interesting paper [13] dealt with the method of maximal 
likelihood in the case of the probability distribution (6) and pointed 
out that the likelihood function depends only on the real variable 
co= Signal (P)/\ g (P) ' therefore, all the information on the parame- 
ter being estimated is carried by CO. However, Ref. [13] stopped 
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short of reformulating the estimation procedure in terms of the 
method of moments. As a kind of test of whether or not ref. [13] 
crossed the discovery line, I note the following: Whereas it would 
have been immediate to make use of the result of [12] in my work 
on jet observables [14] (and it would have saved me a lot of trou- 
ble then and later, even if the result is not general), it would be 
hardly possible to make a similar use of [13]; also, the problem 
was formulated in [12] in such a way that I can easily imagine my- 
self replacing the derivation of [12] with the more general one [1] 
to arrive at the general result (4), but I cannot imagine ref. [13] as 
such a starting point. 

An extension to several parameters, and with inclusion of quad- 
ratic terms (within the ideological framework of [12]) was pre- 
sented and systematically studied in [15], including the connection 
with the FFRC inequality [16]. It resulted in a body of work on 
measurement of (anomalous) trilinear couplings (e.g. [17], [18]). 

A generalization similar to that of [15] was achieved in [19]. 

Note that if the dependence on the parameter being estimated is 
essentially non-linear (as would be the case e.g. with masses, de- 
cay widths, etc.), the general formulae (4) and (9) have to be in- 
voked. 

It is rather amusing that the citations of [12], [15], [19] are 
strongly correlated with the keywords "CP violation", "trilinear 
gauge couplings", "anomalous three gauge boson couplings". 
The corresponding community seems to have no appreciable inter- 
section with that specializing in jet algorithms or in the construc- 
tion of cuts using complicated algorithmic machinery (as repre- 
sented e.g. at the ACAT/AIHENP workshops). 

Still, I find it strange that little, if any, effort seems to have been 
expended to decouple the results of refs. [12], [15], [19] from the 
rather special physical problematics and to make them known be- 
yond the community of weak-couplings specialists, as those for- 
mulae deserved. 
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